In North Carolina, records of juvenile arrests exist in a complex space between public access and legal shielding. With the “Raise the Age” reform in effect, 16- and 17-year-old defendants in specific felony cases are processed as adults in the judicial system, resulting in the release of their identifying information. However, for numerous other young individuals, such particulars are kept private. Consequently, the public is privy to a fragmented representation: age fields are inexplicably redacted, charges are listed devoid of context, and locations suggest broader trends. These disparate details not only raise significant questions regarding privacy, but also challenge our understanding of safety, time, and space within a small town such as Chapel Hill.
Prompted by this complexity, our group concentrated on a dataset of 37,310 arrests effectuated by the Chapel Hill Police Department from 2010 to 2024. From this, we developed two central questions. Firstly, is it possible to forecast the hour of the day with the highest probability of arrests within particular Chapel Hill zip codes, conditional on seasonal variations, semester status, arrest type, and demographic factors? The importance of this question lies in the fact that understanding the temporal and spatial distribution of arrests could enable the police department to optimize resource deployment; for instance, by positioning officers in proximity to nightlife districts during late weekend evenings or around university campuses during stressful periods such as examination weeks. It provides a more lucid depiction of communal patterns and vulnerable periods for the citizenry, thereby assisting both inhabitants and scholars in conducting their daily affairs with greater security.
Secondly, can arrest record clusters featuring redacted age information demonstrate patterns that correlate with juvenile protections, and what social or spatial determinants (e.g., proximity to schools or patrol routes) elucidate these clusters? This question clarifies the real-world effects of laws governing juvenile justice. Despite redactions protecting the identities of juvenile individuals, this may obfuscate where and why juvenile arrests occur. The identification of these clusters may provide insight into the understanding of policing in areas with a high youth population by schools, families, and policymakers, as well as the effects of reforms such as “Raise the Age.”
Collectively, these inquiries convert arrest records from inert data into active narratives, which can educate public policy, aid more intelligent policing, and inspire more profound dialogues regarding justice, privacy, and location within Chapel Hill.
The data used in this project originates from the Chapel Hill Police Department’s arrest logs, which are made publicly available through the data.gov website. While the data was retrieved from this open-source platform, it is originally collected and maintained by the Chapel Hill Police Department. Each observation in the dataset represents an individual arrest event, with information about when and where it occurred, details of the arrest, and key demographic characteristics of the arrested individual. This dataset is not a random sample but rather a semi-comprehensive record of arrest incidents in Chapel Hill from 2010 to 2024, with the exception of several months in 2021 (we have not recieved a response from the CHPD database manager on why this is). After cleaning and filtering, our working dataset contains 37,310 observations, each corresponding to a single arrest. The following table is a representation of the most important variables provided in our data:
| Arrest_Date | Street | Arrest_Type | Drugs_Alcohol | Age | Gender | Race | Disposition | Latitude | Longitude |
|---|---|---|---|---|---|---|---|---|---|
| 2014-03-08 16:18:00 | SEDGEFIELD DR @ FOXWOOD DR | ON VIEW | N | 24 | M | W | CLEARED BY ARREST | 35.95 | -79.02 |
| 2013-02-23 19:32:00 | ROSEMARY @ HENDERSON | TAKEN INTO CUSTODY (WARRANT/LP) | Y | 40 | M | NA | ARREST/NO INVESTIGATION | 35.92 | -79.05 |
| 2012-05-06 00:48:00 | 1201 MARTIN LUTHER KING JR BLVD | ON VIEW | N | 22 | M | W | CLEARED BY ARREST | 35.95 | -79.06 |
| 2018-06-18 04:17:00 | 325 W ROSEMARY ST | SUMMONED/CITED | U | 45 | M | B | CLEARED BY ARREST | 35.91 | -79.06 |
| 2017-05-06 05:20:00 | I40 EXIT 266 | TAKEN INTO CUSTODY (WARRANT/LP) | Y | 21 | M | B | CLEARED BY ARREST | 35.97 | -79.06 |
After doing an exploratory data analysis, we found two trends to investigate further. Firstly, we observed a wide and bimodal distribution of arrest times throughout the day, with distinct peaks around midnight that showed patterned behavior. This pattern varied by day of the week, by location, and by the nature of the arrest itself. These patterns motivated us to build models that predict the “Hour of Arrest” based on contextual variables.
The second trend we noticed was a sizable number of arrest records missing age data. Many of the arrests with missing age data also had missing demographics, such as race, gender, and ethnicity. These arrests were not evenly distributed across Chapel Hill but instead clustered in specific geographic areas. In particular, the police headquarters and East Chapel Hill High School had over 110 arrests each. From this trend we hypothesized that the arrests with unknown ages were those of minors, and the redaction of identifying information was done to protect them. The figure below shows the geographic distribution of arrests with unknown age, larger circles represent more arrests at a location.
To support our analysis, we engineered several notable variables from the original data. From the “Arrest Date”, we extracted the “Hour of Arrest”, “Day of the Week”, “Month”, “Season”, and “Academic Semester” (Spring, Summer, Fall, or Break), based on the University of North Carolina at Chapel Hill’s (UNC-CH) academic calendar. A binary indicator (“Franklin”) was created to identify whether the arrest occurred on Franklin Street, a busy street in downtown Chapel Hill that exhibits measurably higher arrest activity. We also included variables for “Zip Code”, “Latitude”, “Longitude”, and demographics such as “Age”, “Gender”, and “Race”. Underage status was determined by identifying rows where “Age” was missing (as these records correspond to individuals under 18 whose age was withheld). “Disposition” represents the outcome of an arrest and was used in our analysis of Question 2. Variables unrelated to our explorations (such as “Arrest ID”) were excluded.
To answer this question, we developed and compared five different models to predict the hour of arrest. These included:
We explored three modeling approaches to predict the hour of arrest: linear regression, K-Nearest Neighbors (KNN), and Random Forest. Linear regression served as a baseline model to establish a benchmark for model comparison. Due to the non-linearity of our variables we did not expect the linear regression to be successful in predicting the hour accurately. Secondly, we implemented a K-Nearest Neighbors (KNN) model, to predict arrest hour based on the average of the most similar observations. KNN models do well with nonlinearity, but they can struggle with imbalanced data. Finally, we applied a Random Forest model, which is a type of machine learning model that builds many individual decision trees and combines their results to make more accurate and stable predictions. Each tree in the forest looks at a random subset of the data, similar to cross validation. We chose to use 100 trees in our models to reduce their reactivity to noise produced by the amount of variables being analyzed. We chose this model for our data in particular because it can handle a high number of variables of different types, so the more variables it has the better it will do at predicting, which is not always true for other models. The variables present in each model were selected based on their observed relevance in our exploratory analysis. The graphs below show each model’s predictions compared to the actual hour.
All models were evaluated using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to assess their predictive accuracy.
| Model | MAE | RMSE |
|---|---|---|
| Random Forest - Full | 2.747056 | 3.577422 |
| Random Forest - Base | 3.400056 | 4.289152 |
| KNN | 5.799254 | 6.940523 |
| Random Forest - Simple | 6.040784 | 6.992252 |
| Linear | 6.779104 | 7.705276 |
The best-performing model was the Random Forest - Full, which clearly demonstrates that the Random Forest machine learning model improves with more data input. Every model tends to over-predict after midnight and under-predict before midnight. A glance at the residuals (actual - predicted hour) confirms this:
While predictions were relatively accurate around midday, model performance declined significantly during the late-night hours. This is an issue with the linearity of our Hour variable which represented hour 0 (12 AM) and hour 23 (11 PM) as 23 hours apart, despite being only one hour apart in reality. Time of day is inherently circular, not linear. To address this, we transformed the Hour variable using sine and cosine functions to capture its circular nature. This places each hour on the unit circle, preserving its cyclical structure.
\[ \sin\left(\frac{2\pi \cdot \text{Hour}}{24}\right), \quad \cos\left(\frac{2\pi \cdot \text{Hour}}{24}\right) \]
We then trained two new Random Forest models using these transformed values: one to predict the sine of the hour, and another to predict the cosine of the hour. We used the same variables as Random Forest - Full to give our machine learning the most data to make the best predictions.
Once both models generated predictions for sine and cosine values, we reconstructed the predicted hour using the arctangent function, mapping it back to the appropriate angle on the unit circle.
\[ \text{Hour} = \left( \frac{\text{arctan}(\text{Sine Hour}, \text{Cosine Hour}) \cdot 24}{2\pi} \right) \bmod 24 \]
The result is the model below:
This model shows clear improvement at handling times around midnight although it hasn’t completely eliminated the over and under prediction. To evaluate how redefining time improved our model we compared the error to the previous best model based on a linear time representation.
| Model | MAE | RMSE |
|---|---|---|
| Random Forest - Circular | 2.338836 | 6.153505 |
| Random Forest - Full | 2.747056 | 3.577422 |
To further evaluate prediction quality, we visualized residual distributions. In models using linear time, residuals showed clear patterns near the edges of the clock. After applying the circular time model, the residuals were more evenly distributed, indicating a better model fit across the entire 24-hour cycle.
The residuals now follow a sinusoidal pattern centered around zero, rather than a linear distribution. Notably, there are clusters of extreme underpredictions and overpredictions near midnight. These occur because predictions that are close to midnight must be converted back to a 24-hour scale for visualization, which creates the illusion of large errors. However, in a circular representation of time, these values are actually quite close to the true values, so these outliers can largely be ignored.
The resulting model is a relatively accurate prediction of the hour of day an arrest will occur based on the factors of that arrest. This model has practical value for both the Chapel Hill Police Department and the UNC-CH student body. For the police, models like this can inform resource allocation around the town, enabling officers to be strategically positioned during high-risk hours. Arrests and crimes with similar characteristics can be modeled to windows of time so that officers know what to watch out for at different times of day. For students, this information can inform risk-taking behaviors, influence safer daily routines, and increase awareness about hours requiring increased vigilance. To make this model more useful, future work can incorporate additional variables such as campus events, holidays, patrol routes, etc. that may impact arrest patterns. Before this model can be used in the real world, it is essential that it be evaluated for fairness across demographic groups to ensure it doesn’t reinforce harmful biases.
To explore age-based differences in arrest processing, we analyzed arrest records from Chapel Hill between 2010 and 2023 using descriptive statistics and data visualizations in R. We created a binary variable for underage status and grouped the data by “Arrest_Type” and “Disposition.” For each group, we calculated both raw counts and proportions, and visualized the results using bar plots with overlaid labels. This approach allowed us to clearly compare processing trends between minors and adults, helping to uncover systematic differences in how each group is treated by law enforcement.
Our analysis revealed that underage individuals were more likely to be Summoned/Cited (45.2%) compared to non-underage individuals (34.9%). In contrast, non-underage individuals were more often Taken into Custody via Warrant (27.4%) than underage individuals (18.7%). The proportion of On View arrests was nearly identical between both groups. These differences suggest that minors are less likely to be physically detained, possibly due to legal protections or procedural considerations aimed at youth populations.
These results may reflect practical or legal considerations when handling minors. Police may prefer issuing citations rather than physical detainment for youth, or legal requirements may limit the use of warrants for minors.
For Dispositions, we found that underage individuals were significantly more likely to have their cases formally resolved. Of 1,843 underage arrests, 95.4% were cleared by arrest, compared to 92.4% of 35,317 non-underage arrests. Fewer underage arrests resulted in “Arrest/No Investigation” (4.5% vs. 7.4%). This suggests juvenile cases may follow more formal or legally required processing.
These findings suggest that minors in Chapel Hill are processed through more uniform and formal legal channels receiving citations more often and having their cases cleared at higher rates likely due to protective policies around youth. This pattern creates an opportunity to assess whether current adult arrest procedures are overly variable or inconsistent. For policymakers, this insight could justify a review of discretionary arrest practices, especially for non-violent offenses, with an eye toward expanding structured alternatives like citations or community-based interventions. The consistency in how underage arrests are handled suggests a successful model of structured processing. Law enforcement could pilot more standardized procedures for non-violent adult arrests, reducing discretionary variability and improving transparency.
We explored two central questions: (1) “Can we predict the hour of day when arrests are most likely to occur in Chapel Hill based on features such as location, time, and demographic factors?” And (2) “Are underage individuals treated differently than adults in terms of arrest type and case disposition?”
For the first question, we developed multiple predictive models, ultimately finding that a Random Forest model using a circular transformation of time (via sine and cosine) was the most accurate and precise. Some model limitations remained, particularly around late-night hours; the circular approach improved residual patterns and reduced bias near midnight. By identifying consistent patterns in when and where arrests occur, police departments can more effectively allocate officers appropriately, depending on peak hours, such as late nights on weekends near Franklin Street. Our models found that features such as semester status and day of the week meaningfully improved prediction accuracy, suggesting that events tied to the academic calendar influence arrest timing. This opens conversations for university administrators to schedule targeted safety communications, coordinate mental health resources, or determine necessary campus patrol procedures. Rather than reacting to incidents, both law enforcement and university leadership can proactively plan around predictable arrest patterns, with the goal of improving public safety while reducing unnecessary interventions.
In the second part of our project, where we used missing age fields as a proxy for underage status, we found that these individuals were more often cited than detained and were slightly more likely to have their cases formally cleared. To identify minors in the dataset, we utilized the absence of age values, as mandated by North Carolina law, which shields juvenile information. We analyzed two dimensions of differential treatment: case disposition and arrest type. Our results suggest that law enforcement treats minors differently in ways that are likely influenced by both legal requirements and departmental procedures dealing with youth. These findings are likely a result of a combination of juvenile justice policy and officer decision-making practices aimed at minimizing harm and legal complexity in youth cases. Working with indirect data poses a challenge, but we uncovered consistent trends in how youth are processed. For policymakers, even though Raise the Age legislation restricts access to juvenile records, patterns in redacted data, such as clusters of missing age fields and citation-heavy arrest types, still offer insight into how minors are policed and processed. This could inform future reforms, such as clearer standards for citations versus detainments. Findings also highlight where youth policing is concentrated, raising questions about whether certain locations are disproportionately represented. For families and educators, understanding how youth are handled by law enforcement can help support community outreach, legal education, and effective policing practices.
Our findings highlight how publicly available arrest records can provide valuable insights into the timing and nature of local law enforcement practices. Future research should incorporate variables such as campus events, patrol routes, arrest severity, recidivism rates, or social event data to enhance the accuracy of the predictive model. Exploring equity across intersections of race, gender, and the socio-spatial context of arrests would also help ensure ethical and responsible use of the data. Additionally, gaining access to more detailed disposition data and complete age information would enable a more nuanced understanding of how policing practices and judicial law affect youth populations. Our dual analysis highlights both the when and how of arrests in Chapel Hill. By combining machine learning and policy analysis, these insights have the potential to inform policy, enhance safety strategies, and foster discussions about equity and privacy in public policing.